MGcount user guide

Getting started

What is MGcount

MGcount is a RNA-seq quantification tool conceived to incorporate ambiguous read alignments in a flexible way that is compatible with any biotype. This allows to extract more information from total RNA-seq datasets by simultaneous quantifying coding and non-coding transcripts, both small and long, where dealing with heterogeneous read-to-feature alignment ambiguities is key. When aligning reads to a reference genome, we distinguish between two types of ambiguous alignments:

  • Multi-mappers: reads aligning to multiple genomic locations.
  • Multi-overlaps: reads aligning to a genomic location with multiple annotated features.

To deal with the most frequent multi-overlaps (occuring when a smallRNA loci is embeded within a longRNA), MGcount hierarchically assigns reads to smallRNA and longRNA accounting for their transcript length disparity. Subsequently, MGcount models read-to-feature alignments in a graph. This is exploited to detect and report expression of sequence-similar annotated loci, where reads systhematically multi-map, as integrated features called communities.

System requirements

MGcount is built on the top of FeatureCounts, a well-known computational efficient software (Liao et. al., 2014). Please download it from the following link: http://subread.sourceforge.net/

Installation

MGcount is written in Python and is executed from the command line. You can either download the executable version (single binary file) or install it as a Python3 module.

Download and run the executable program

You can download the latest release here.

Please, save the program file to your Linux system and set the permissions to allow executing the file as program:

chmod +x mgcount

Once the file is executable, you may run the tool by calling the file from the command line with the desired arguments.

mgcount [args]

Install and run as a Python3 module

Alternatively you can install the package as a Python module:

pip3 install git+https://github.com/hitaandrea/MGcount.git

Once the package is installed, you may run the tool as a Python installed module:

python3 -m mgcount [args]

Workflow

MGcount workflow schema

MGcount workflow schema

During its execution, MGcount performs the following 3 steps:

1. Hierarchical assignation

First, a hierarchy is used to solve multi-overlapping ambiguities across biotypes during alignment-to-feature assignment based on transcript body length. Reads are assigned to genomic annotated features in three pre-defined sequential rounds: small, long_exon and long_intron. Hence, all reads with at least one alignment overlapping with an annotation in the current round are assigned to such feature and skipped in subsequent rounds.

2. Multi-loci communities recognition

To quantify multi-mapping reads, MGcount builds a directed weighted graph {G=(V,E)} where each vertex (V) is an annotated feature and a pair of directional edges (E) connect two features for which common multi-mapping reads exist. Edge weithgs are defined as the ratio of multi-mapping reads between the two vertex normalized by the total number of reads assigned to the source vertex. Vertex weights are defined as the log-transformed number of assigned alignments (Fig c). Resultant graphs structures capture the multiple-loci topologies of different RNA biotypes. Two graphs are sepparately build for smallRNA and longRNA, using the full pool of input alignments. Subsequently, highly-related features are grouped together by minimizing the map equation with the communities detection approach described by Rosvall et al., 2008 .

3. Expression matrix generation

MGcount generates an output expression matrix for each hierarchical assignation round that are appended together in a single output matrix. For each read, each alignment first gets an 1/N count, where N is the number of multi-mappers or residual multi-overlaps within biotypes that survive the hierarchical assignment. Next, counts for annotations which have been aggregated together in a community by the map equation are summed up. In this way, the systematic ambiguity in multi-mapping reads gets collapsed into a single MG community while the remaining signal is reported as fractionated counts.

Usage description

Inputs

Three inputs are required to run the program:

  • Input alignment file: a .txt file listing the paths to the .bam alignment input files by line
  • Annotations gtf file: a .gtf file containing a set of RNA feature annotations
  • Output directory path: a string specyfing the path to the directory where MGcount outputs will be stored

Configurable arguments

Configurable arguments list

Optional arguments to configure a MGcount run include:

Argument Description Default value
–paired_flag (-p) Paired end flag. If null, the assignation occurs in single-end mode. False
–strand_option (-s) Library strandness. Options available are 0: unstranded, 1: forward-stranded and 2:reverse-stranded 1
–featureCounts_path Path to featureCounts software executable file /usr/bin/featureCounts
–btyperounds_filename Optional .csv file with biotype to assignation round associations. It should be a two columns table where columnames are, in order, “biotype” and “counting_round”.
–feat_small GTF feature type entry for smallRNA reads assignation transcript
–feat_output_small GTF field name for which to summarize counts of longRNA assigned reads transcript_name
–feat_biotype_small GTF field name defining biotype for smallRNA features transcript_biotype
–ml_flag_small Multi-loci graph detection based groups flag for smallRNA features 1
–min_overlap_small Minimal feature-alignment overlapping fraction for assigning a read to a smallRNA feature 1
–feat_output_long GTF field name for which to summarize counts of longRNA assigned reads gene_name
–feat_biotype_long GTF field name defining biotype for longRNA features gene_biotype
–min_overlap_long Minimal feature-alignment overlapping fraction for assigning a read to a longRNA feature 1
–ml_flag_long Multi-loci graph detection based groups flag for longRNA features 1
–th_low Low minimal threshold of feature-to-feature multi-mapping fraction. 0.01
–th_high High minimal threshold of feature-to-feature multi-mapping fraction. 0.75
–subs Optional sub-sapling number of alignments to train the multigraph network. If 0, include all. 0
–n_cores (-T) Number of cores for parallelisation by sample 1
–sample_id SampleID input file names None
–seed Optional fixed seed for random numbers generation during communities detection

Configurable arguments details

Sequencing data type

Two arguments need to be set following the type of input data. For a correct interpretation of RNA-seq data during assignation, the integer argument –strand_option need to be set according to the strandness of the library preparation method utilized (0:unstranded, 1: forward-stranded, 2:reverse-stranded). If dealing with paired reads, –paired_flag should be added to the command line call.

Multi-core mode

MGcount may process samples in parallel in all three steps of the workflow (hierarhical assignation, multigraph generation and count matrix building). The number of CPUs to be used by MGcount can be defined with the –n_cores option. We only recommend this if runing the tool in a server or a high performance computing seting due to high RAM consumption in parallel mode.

Assignation rounds configuration

Round feature feature_output feature_biotype min_overlap ml_flag
small transcript transcript_name transcript_biotype 1 True
long_exon exon gene_name gene_biotype 1 True
long_intron gene gene_name gene_biotype 1 True

At each round of the hierarchical assignation, MGcount extracts the set of annotations in the gtf with entry type –feature whose –feature_biotype attribute is included in the list of biotypes associated to the round (as defined by the csv file –btyperounds_filename). Subsequently, alignments are assigned to the restricted set of annotations whenever a minimum read fraction (as defined by –min_overlap) overlaps with an annotated feature of the extracted annotations subset.

If featureCounts software is not accessible on the system path (/usr/bin/), the full path to the software should be set through –featureCounts_path.

The association of diferent biotypes to either “long” or “small” assignation rounds can be customized in a csv file and parsed with –btypecrounds_filename argument to MGcount. The csv file must be a two columns table with names biotype and assignation_round.

By default, MGcount utilizes a csv file embeded with the program (or alternative installed with the Python module), that is located in the /mgcount/data subfolder of the Github repository. This table links the set of biotypes encountered in the 4 integrated gtf provided (Arabidopsis, Human, Mouse and Nematode) to the corresponding pre-defined small and long rounds.

For runing MGcount in further species or different annotations set, please make sure the biotypes you want to include in the quantification are correctly listed in this table for MGcount to recognize them.

At each round of the hierarchical assignation, alignment-feature assignation pairs are determined with FeatureCounts restricted to the designated subset of the .gtf annotated features. Each round can be configured by the user through the following five arguments:

Communities detection

To speed-up computation time, a fixed number of random sub-sampled alignments per sample can be set to build the graph through –subs.

The –seed argument may be set to guarantee exact solutions accross runs. The seed will be used as the starting point to generate random numbers during the communities detection approach.

MGcount ignores weak edges during map equation optimization based on a high threshold –th_high (by default 0.75) and a low threshold –th_low (by default 0.01) to prevent for over-fitting (splitting of large densely connected communities and merging of small loosely connected communities). Thresholds are empoyed as follows according to the type of graph:

  • LongRNA graph: All edges whose weight are below the high threshold are ignored for the longRNA graph. This avoids collapsing together certain features sharing only partial similarity. Given the long body length, multi-mappers may occur in only a specific part of the locus. In this situations, the threshold determines how large should the shared reads proportion between two features to be considered for a community. Lower thresholds will tend to aggregate low ambiguity in commuinties while high threholds will force features to remain single by spliting the multi-mapping reads as a fraction.

  • SmallRNA graph: The use of the high threshold or low threshold for weak edges filtering deppends on the edge weights distribution in the smallRNA graph. Here, for each biotype (microRNA, piRNA, snRNA, …), the threshold that is closer to the graph weights’ first quantile is employed. In this way, for biotypes were repeated loci are identical or nearly identical (f.i. microRNA), only high weights above the high threshold will be considered for communities while for biotypes with large groups of similar loci (snRNA, YRNA and pseudogenes, ….), all weights will be considered.

Outputs

At the end of its execution, MGcount provides the following outputs:

  • Count matrix: Expression estimation matrix where each row corresponds to a transcript feature and each column corresponds to 1 input .bam file.
  • Features metadata: It in, which match row names in the count matrix; the counting round when assigned; a flag designing whether a feature is a unique annotation or a MG community; and the feature biotype.
  • Multi-mapping graph adjacency matrix: Adjacency matrix of links between annotated features allowing to generate the smallRNA or/and longRNA multi-mapping graph, encoded as sparse Matrix in MatrixMarket standard format.
  • Multi-mapping graph communities (MGcommunities): Table linking each original feature in the gtf with the resultant feature matching count matrix and feature metadata feature identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). In addition, the table provides with the total number of alignments per feature.

Tutorials

T1 - Prepare inputs and execute

In the following tutorials, we use two Human Brain samples as example to walk through MGcount execution. First of all, let’s create a folder and download the aligment bam files of the two samples. (The downloading process might take a few minutes).

mkdir mgcount_tutorial
cd mgcount_tutorial
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/humanbrain_bamfiles.zip
unzip humanbrain_bamfiles.zip -d input_bamfiles

To run MGcount, we need to provide the software with a .txt file specifying the paths to the input aligment files. Here, these are the two .bam alignment files we just downloaded. We can generate this file from the command line:

printf "$PWD/%s\n" input_bamfiles/* > input_bamfilenames.txt

The other required input is a .gtf file with transcript features annotations. MGcount repository provides with four ready gtf files integrating annotations from several databases. Next, we will download them and use the Human annotations file.

wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/integrated_annotations_gtf.zip
unzip integrated_annotations_gtf.zip -d annotations_gtf

Once we have both the gtf and the input .bam files, we can simply run MGcount as a python3 module. A python module is run by calling python3 with the parameter -m and the name of the module. After this, we can specify MGcount arguments.

For this example, we parse the ready-integrateg gtf for the Human genome, the txt file we just created containing the path to the input aligment .bam files and a string designating the directory where MGcount outputs will be generated.

python3 -m mgcount --gtf annotations_gtf/Homo_sapiens.GRCh38.gtf --outdir outputs --bam_infiles input_bamfilenames.txt

If you have been able to run MGcount successfully, your output directory should contain 6 new files generate by MGcount including the RNA count matrix, the feature metadata table and two files containing the graph structure and the communities detected for longRNA and smallRNA respectively.

We are done!

T2 - Explore quantification outputs

In this tutorial, we will load the MGcount outputs we obtained in tutorial 1 and we will explore them from R. MGcount repository contains a few supporting scripts in R providing with side functionalities (manage annotations, visualize MGcount outputs…). It is possible to download the sole folder containing these scripts with the following shell command:

cd mgcount_tutorial
svn export https://github.com/hitaandrea/MGcount/trunk/R

Once the scripts are downloaded, we are ready to launch R. Please, start an R session and define the tutorial root directory as a variable.

Further, this tutorial uses the following R packages. Please, install them with install.packages() if you wish to run this tutorial on your system.

## Warning in fun(libname, pkgname): couldn't connect to display ":0"

The main output of MGcount is the count_matrix.csv file containing the feature by sample expression matrix. We can import it into R as any regular csv.

Each row in the count table is a feature for which expression has been quantified and each column is associated to a sample. By interrogating for the matrix dimension, we see the matrix contains two columns corresponding to the two human brain libraries. The matrix is ready to be used in any RNA-seq downstream analysis.

## [1] 47539     2

We can look at the total number of reads assigned from each library by suming up each of the two rows. The mean counts over the two libraries is approximately 4.2 million.

## Human_Brain_total_100ng_1 Human_Brain_total_100ng_2 
##                   3837246                   4578696
## [1] "Mean counts: 4207971.05"

Let’s import now the feature_metadata output. This table reports feature-related attributes such as the assignation round to which the annotation belongs, a flag stating wheter the feature is an aggregated community of annotations or an individual feature and the biotype associated to the feature. The feature identifiers in the counts_matrix and the feature_metadata match and therefore, this table can facilitate the extraction of particular features from the count matrix, e,g, a certain biotype, the exonic counts, the subset of features aggregated in communities, etc…)

Here, for example, we profit from the feature_metadata table to look at counts distribution by assignation round.

Also, we can exploit the feature_metadata table to extract stats from the MGcount run. Here we look at the proportion of aggregated features (community_flag variable) by biotype (feature_biotype). The output shows how smallRNA biotypes tend to be more aggregated in communities because duplicated loci are more frequent.

False True Total
Hybrid 18 8 26
IG_C_gene 8 0 8
IG_V_gene 6 0 6
IG_V_pseudogene 5 0 5
lncRNA 10835 192 11027
miRNA 504 30 534
misc_RNA 279 9 288
Mt_rRNA 2 0 2
Mt_tRNA 22 0 22
piRNA 957 68 1025
polymorphic_pseudogene 16 0 16
processed_pseudogene 2500 70 2570
protein_coding 24446 5026 29472
ribozyme 3 0 3
rRNA 6 2 8
rRNA_pseudogene 104 17 121
scaRNA 15 5 20
snoRNA 178 99 277
snRNA 250 22 272
TR_C_gene 8 0 8
TR_V_gene 12 0 12
TR_V_pseudogene 3 0 3
transcribed_processed_pseudogene 242 9 251
transcribed_unitary_pseudogene 147 2 149
transcribed_unprocessed_pseudogene 667 20 687
tRNA 46 23 69
unitary_pseudogene 31 0 31
unprocessed_pseudogene 619 8 627
Total 41929 5610 47539
False True Total
Hybrid 0.6923077 0.3076923 1
IG_C_gene 1.0000000 0.0000000 1
IG_V_gene 1.0000000 0.0000000 1
IG_V_pseudogene 1.0000000 0.0000000 1
lncRNA 0.9825882 0.0174118 1
miRNA 0.9438202 0.0561798 1
misc_RNA 0.9687500 0.0312500 1
Mt_rRNA 1.0000000 0.0000000 1
Mt_tRNA 1.0000000 0.0000000 1
piRNA 0.9336585 0.0663415 1
polymorphic_pseudogene 1.0000000 0.0000000 1
processed_pseudogene 0.9727626 0.0272374 1
protein_coding 0.8294653 0.1705347 1
ribozyme 1.0000000 0.0000000 1
rRNA 0.7500000 0.2500000 1
rRNA_pseudogene 0.8595041 0.1404959 1
scaRNA 0.7500000 0.2500000 1
snoRNA 0.6425993 0.3574007 1
snRNA 0.9191176 0.0808824 1
TR_C_gene 1.0000000 0.0000000 1
TR_V_gene 1.0000000 0.0000000 1
TR_V_pseudogene 1.0000000 0.0000000 1
transcribed_processed_pseudogene 0.9641434 0.0358566 1
transcribed_unitary_pseudogene 0.9865772 0.0134228 1
transcribed_unprocessed_pseudogene 0.9708879 0.0291121 1
tRNA 0.6666667 0.3333333 1
unitary_pseudogene 1.0000000 0.0000000 1
unprocessed_pseudogene 0.9872408 0.0127592 1
Total 0.8819916 0.1180084 1

Below, we display a few random rows from the table for illustration.

feature assignation_round annotations_subset feature_type feature_output feature_biotype community_flag
1651 RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P small small transcript transcript_name snRNA True
1842 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc small small transcript transcript_name snoRNA True
2140 hsa-miR-137-3p small small transcript transcript_name miRNA False
1016 PIRNA54381 small small transcript transcript_name piRNA False
15607 HMGA1-exon long_exon long exon gene_name protein_coding True
20339 NUCKS1-exon long_exon long exon gene_name protein_coding False
36520 GATD3A_GATD3B_FP565260.2-intron long_intron long gene gene_name protein_coding True
36140 FCGRT-intron long_intron long gene gene_name protein_coding False

Next we show how to generate a barplot showing the read distribution by biotype group. First lets load a table to group biotypes by category. We will use this to group less abundant biotypes in larger groups for visualization purposes.

We then combine feature_metadata table with the count_matrix again and sum up the counts to get the total number of reads by biotype. We do this sepparately by biotype groups and small-non-coding individual biotypes to further represent the small non-coding spectrum.

Once we have extracted the counts by biotype group, let’s employ ggplot to visualize the read distribution profiles as barplots.

Reads distribution by biotype:

Finally, we import the multigraph_communities tables. These tables link each original feature in the gtf with the resultant feature matching the count matrix and feature metadata identifiers. It includes both unique features (that remain unmodified) and aggregated features (that are collapsed following MG communities). Thus, we can track back the features groupped in each aggregated feature.

Here we will use the feature_metadata subset from before to explore the original features in the gtf forming each new MG community feature.

transcript_name transcript_biotype naln naln_community community_flag community_id community_name community_biotype feature
262 RNU6-1310P snRNA 1 56 True snRNA-cl-32 RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P snRNA RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P
645 RNU6-788P snRNA 1 56 True snRNA-cl-32 RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P snRNA RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P
900 RNU6-838P snRNA 45 56 True snRNA-cl-32 RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P snRNA RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P
2082 RNU6-1238P snRNA 1 56 True snRNA-cl-32 RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P snRNA RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P
2147 RNU6-1123P snRNA 8 56 True snRNA-cl-32 RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P snRNA RNU6-838P_RNU6-1123P_RNU6-1310P_RNU6-788P_RNU6-1238P
2754 SNORD115-1 snoRNA 2429 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2755 SNORD115-2 snoRNA 600 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2756 SNORD115-3 snoRNA 1566 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2757 SNORD115-4 snoRNA 2465 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2758 SNORD115-5 snoRNA 3984 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2759 SNORD115-6 snoRNA 2701 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2760 SNORD115-7 snoRNA 383 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2761 SNORD115-8 snoRNA 1815 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2762 SNORD115-9 snoRNA 5646 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2763 SNORD115-10 snoRNA 4233 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2764 SNORD115-11 snoRNA 5787 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2765 SNORD115-12 snoRNA 5646 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2766 SNORD115-13 snoRNA 2602 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2767 SNORD115-14 snoRNA 2600 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2768 SNORD115-15 snoRNA 4024 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2769 SNORD115-16 snoRNA 3272 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2770 SNORD115-17 snoRNA 4093 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2771 SNORD115-18 snoRNA 4093 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2772 SNORD115-19 snoRNA 4093 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2773 SNORD115-20 snoRNA 3539 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2774 SNORD115-21 snoRNA 2801 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2775 SNORD115-22 snoRNA 3958 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2776 SNORD115-23 snoRNA 1982 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2777 SNORD115-24 snoRNA 973 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2778 SNORD115-25 snoRNA 778 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2779 SNORD115-26 snoRNA 3029 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2780 SNORD115-27 snoRNA 43 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2781 SNORD115-28 snoRNA 137 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2782 SNORD115-29 snoRNA 5787 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2783 SNORD115-30 snoRNA 1630 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2784 SNORD115-31 snoRNA 315 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2785 SNORD115-32 snoRNA 718 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2786 SNORD115-33 snoRNA 1928 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2787 SNORD115-34 snoRNA 3814 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2788 SNORD115-35 snoRNA 399 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2789 SNORD115-36 snoRNA 5787 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2790 SNORD115-37 snoRNA 97 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2791 SNORD115-38 snoRNA 2592 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2792 SNORD115-39 snoRNA 3541 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2793 SNORD115-40 snoRNA 3464 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2794 SNORD115-41 snoRNA 3288 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2795 SNORD115-42 snoRNA 3956 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2796 SNORD115-43 snoRNA 5787 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2797 SNORD115-44 snoRNA 2436 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
2799 SNORD115-48 snoRNA 629 125440 True snoRNA-cl-214 SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc snoRNA SNORD115-29_SNORD115-11_SNORD115-36_SNORD115-43_SNORD115-12_SNORD115-9_SNORD115-10_SNORD115-19_etc
4552 PIRNA54381 piRNA 36 36 False PIRNA54381
5508 hsa-miR-137-3p miRNA 79 79 False hsa-miR-137-3p
gene_name gene_biotype naln naln_community community_flag community_id community_name community_biotype feature
2790 NUCKS1 protein_coding 1714 1714 False NUCKS1
8352 HMGA1P2 processed_pseudogene 4 78 True long-cl-6754 HMGA1 protein_coding HMGA1
11087 AL139095.4 lncRNA 1 78 True long-cl-6754 HMGA1 protein_coding HMGA1
11653 HMGA1 protein_coding 65 78 True long-cl-6754 HMGA1 protein_coding HMGA1
14862 HMGA1P1 processed_pseudogene 4 78 True long-cl-6754 HMGA1 protein_coding HMGA1
23575 HMGA1P3 processed_pseudogene 1 78 True long-cl-6754 HMGA1 protein_coding HMGA1
24010 HMGA1P6 processed_pseudogene 3 78 True long-cl-6754 HMGA1 protein_coding HMGA1
gene_name gene_biotype naln naln_community community_flag community_id community_name community_biotype feature
15793 AL672032.1 lncRNA 1 250 True long-cl-11813 GATD3A_GATD3B_FP565260.2 protein_coding GATD3A_GATD3B_FP565260.2
33419 FCGRT protein_coding 53 53 False FCGRT
34841 GATD3B protein_coding 80 250 True long-cl-11813 GATD3A_GATD3B_FP565260.2 protein_coding GATD3A_GATD3B_FP565260.2
34842 FP565260.2 protein_coding 71 250 True long-cl-11813 GATD3A_GATD3B_FP565260.2 protein_coding GATD3A_GATD3B_FP565260.2
35199 GATD3A protein_coding 98 250 True long-cl-11813 GATD3A_GATD3B_FP565260.2 protein_coding GATD3A_GATD3B_FP565260.2

We are done!

T3- Explore output multi-graph

Below, we use a few functions defined in “mg_visualize.R” to graphically explore the multi-mapping graph topology.

The function ‘mg_build’ takes the Multi-Graph adjacency matrix and the MG communities and creates an igraph object. The next two functions provide with the code to generate a few default plots given a Multi-Graph igraph object.

As an example, we will load the smallRNA Multi-Graph adjacency matrix and we will subset it to explore the sub-graph associated to microRNA features.

The adjacency matrix is stored as a sparse symetric matrix in MatrixMarket format, which can be imported in R by the readMM function from ‘Matrix’ R package. We also import the table containing all annotated features and MG communities outputs.

Next we subset the matrix by selecting the features under ‘miRNA category’. Each row in the Multi-Graph communities table table corresdponds to a feature annotation with non-zero assignments in the small round. The row index of each feature defines its column and row position in the Multi-Graph adjacency matrix.

Necessary steps to convert the adjacency matrix and the communities table to an igraph object are in the mg_build function.

We can explore the patterns in the feature gene symbols established by the HGNC and link that to specific colors during visualization

The function mg_plotset generates a set of different plots of the Multi-Mapping graph and stores them as png files with a user-given file prefix (plotfile variable).

Raw visualization of the graph

Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol

Visualization of the graph colored by -3/-5 patterns in microRNA transcript symbol and detected communities

We are done!

Integrating annotation sources

MGcount is a tool conceived to analyse heterogeneous datasets capturing diverse non-coding transcript types but the scope of features quantified by MGcount is bounded by the features annotated in the reference gtf file used as input. On this line, altough MGcount can be executed with any gtf annotations file (Ensembl, Gencode, etc…), we provide with the option to integrate annotations from several databases to take into account a more complete or/and up-to-date annotations set in the quantification, specially for small regulatory RNAs (piRNA, tRF, microRNA, siRNA). These transcripts are not necessarily annotated in general databases.

In tutorial 1, we used a ready integrated gtf for the human genome. Besides, the directory integrated_annotations_gtf we downloaded in tutorial 1, provides with integrated annotations gtf files for Arabidopsis, Mouse and Nematode that can be used for runing MGcount in the corresponding spcies. The following databases have been integrated for each specie:

  • Arabidopsis thaliana: Ensembl, miRBase (microRNA) and RNACentral (siRNA);
  • Homo Sapiens Ensembl, DASHR (piRNA and tRNA fragments [tRF]) and miRBase (microRNA);
  • Mus musculus: Ensembl, miRBase (microRNA) and RNAcentral (piRNA);
  • Caenorhabditis elegans: Ensembl and miRBase (microRNA).

For runing MGcount in any other genome, we encourage to follow the same procedure we followed to generate the last 4 gtfs. The script we used to generate the last 4 gtf is provided in the R folder from the MGcount github repository (that can be individually download with the command in tutorial 2). We hope the script can be used as a template for custom gtf integration in other species.